LoL Esports Model Building For Predict Support Position¶

Name(s): Andrew Zhao, Yiheng Yuan

Website Link: https://asdacdsfca.github.io/LoL_Model/

Prediction Question: Predict whether a player's position is support given their post-game data.¶

Code¶

In [1]:
import pandas as pd
import numpy as np
import os

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Framing the Problem¶

  1. Prediction Question: Predict whether a player's position is support given their post-game data.

  2. Type: Classification

We chose classification because our prediction target is a categorical data type. Classification is a predictive model to identify discrete output variables, such as labels or categories.

  1. Binary Classification

Since we only predict whether the position is "support," each data sample will assign one and only one label from two mutually exclusive classes, such as true and false.

  1. Response Variable: position

We chose it because our sample questions implies that we want to predict the roles of the players which are the values in position column.

  1. Metric to Evaluate: Accuracy

The reason we chose accuracy over other suitable metrics is we are more focus on how "good" our prediction is compare to actual result and

Get the original dataset:

In [2]:
lol = pd.read_csv(('2022_LoL_esports_match_data_from_OraclesElixir.csv'))
lol
/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3442: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[2]:
gameid datacompleteness url league year split playoffs date game patch ... opp_csat15 golddiffat15 xpdiffat15 csdiffat15 killsat15 assistsat15 deathsat15 opp_killsat15 opp_assistsat15 opp_deathsat15
0 ESPORTSTMNT01_2690210 complete NaN LCK CL 2022 Spring 0 2022-01-10 07:44:08 1 12.01 ... 121.0 391.0 345.0 14.0 0.0 1.0 0.0 0.0 1.0 0.0
1 ESPORTSTMNT01_2690210 complete NaN LCK CL 2022 Spring 0 2022-01-10 07:44:08 1 12.01 ... 100.0 541.0 -275.0 -11.0 2.0 3.0 2.0 0.0 5.0 1.0
2 ESPORTSTMNT01_2690210 complete NaN LCK CL 2022 Spring 0 2022-01-10 07:44:08 1 12.01 ... 119.0 -475.0 153.0 1.0 0.0 3.0 0.0 3.0 3.0 2.0
3 ESPORTSTMNT01_2690210 complete NaN LCK CL 2022 Spring 0 2022-01-10 07:44:08 1 12.01 ... 149.0 -793.0 -1343.0 -34.0 2.0 1.0 2.0 3.0 3.0 0.0
4 ESPORTSTMNT01_2690210 complete NaN LCK CL 2022 Spring 0 2022-01-10 07:44:08 1 12.01 ... 21.0 443.0 -497.0 7.0 1.0 2.0 2.0 0.0 6.0 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
149227 9687-9687_game_5 partial https://lpl.qq.com/es/stats.shtml?bmid=9687 DC 2022 NaN 0 2022-12-27 12:43:43 5 12.23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
149228 9687-9687_game_5 partial https://lpl.qq.com/es/stats.shtml?bmid=9687 DC 2022 NaN 0 2022-12-27 12:43:43 5 12.23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
149229 9687-9687_game_5 partial https://lpl.qq.com/es/stats.shtml?bmid=9687 DC 2022 NaN 0 2022-12-27 12:43:43 5 12.23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
149230 9687-9687_game_5 partial https://lpl.qq.com/es/stats.shtml?bmid=9687 DC 2022 NaN 0 2022-12-27 12:43:43 5 12.23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
149231 9687-9687_game_5 partial https://lpl.qq.com/es/stats.shtml?bmid=9687 DC 2022 NaN 0 2022-12-27 12:43:43 5 12.23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

149232 rows × 123 columns

In [3]:
lol_copy = lol.copy()

Clean out the position that equals to team so we are only looking at the data of players

In [4]:
lol_cleaned = lol_copy.loc[lol['position']!='team', :]

Get all the columns that is need for the prediction question

In [5]:
lol_cleaned = lol_cleaned[['patch', 'champion','position','kills', 'deaths', 'assists' ,'dpm', 'damageshare', 
                'damagetakenperminute', 'vspm', 'earned gpm', 'cspm']]

Identify all the rows that contains NaN values:

In [6]:
lol_cleaned.loc[lol_cleaned['dpm'].isna()]
Out[6]:
patch champion position kills deaths assists dpm damageshare damagetakenperminute vspm earned gpm cspm
17868 12.02 Gwen top 2 3 3 NaN NaN NaN NaN 274.9249 9.3093
17869 12.02 Lee Sin jng 2 4 3 NaN NaN NaN NaN 135.8258 3.6036
17870 12.02 Lissandra mid 1 3 1 NaN NaN NaN NaN 210.1802 8.4685
17871 12.02 Jhin bot 0 4 3 NaN NaN NaN NaN 177.9880 7.4474
17872 12.02 Yuumi sup 1 4 5 NaN NaN NaN NaN 74.7748 0.9309
17873 12.02 Tryndamere top 1 3 5 NaN NaN NaN NaN 248.4985 8.4384
17874 12.02 Jarvan IV jng 3 0 14 NaN NaN NaN NaN 225.7357 5.7057
17875 12.02 Viktor mid 3 2 11 NaN NaN NaN NaN 256.0661 8.3784
17876 12.02 Kai'Sa bot 10 0 6 NaN NaN NaN NaN 430.8408 10.7808
17877 12.02 Nautilus sup 1 1 13 NaN NaN NaN NaN 119.1291 1.0511

Since there are only 10 rows over all 124360 rows, we can ignore them.

In [7]:
lol_cleaned = lol_cleaned.loc[lol_cleaned['dpm'].notna()]

Transform the position column into a binary column, 1 if the position is support, otherwise 0.

In [8]:
lol_cleaned.loc[lol_cleaned['position']!='sup', ['position']] = 0
lol_cleaned.loc[lol_cleaned['position']=='sup', ['position']] = 1
lol_cleaned['position'] = lol_cleaned['position'].astype(int)
lol_cleaned
Out[8]:
patch champion position kills deaths assists dpm damageshare damagetakenperminute vspm earned gpm cspm
0 12.01 Renekton 0 2 3 2 552.2942 0.278784 1072.3993 0.9107 250.9282 8.0911
1 12.01 Xin Zhao 0 2 5 6 412.0841 0.208009 944.2732 1.6813 188.0210 5.1839
2 12.01 LeBlanc 0 2 2 3 499.4046 0.252086 581.6462 1.0158 208.2312 6.7601
3 12.01 Samira 0 2 4 2 389.0018 0.196358 463.8529 0.8757 239.4046 7.9159
4 12.01 Leona 1 1 5 6 128.3012 0.064763 475.0263 2.4168 101.8564 1.4711
... ... ... ... ... ... ... ... ... ... ... ... ...
149225 12.23 Jax 0 4 0 5 450.5737 0.171729 608.3352 0.7762 331.7885 9.4826
149226 12.23 Vi 0 2 4 11 201.7660 0.076899 762.7897 1.3161 211.6198 4.9944
149227 12.23 Ahri 0 6 3 8 647.4128 0.246762 553.8695 2.2610 292.4747 7.9303
149228 12.23 Varus 0 7 0 12 954.3982 0.363768 292.1035 1.5186 351.4961 8.4702
149229 12.23 Ashe 1 2 1 13 369.5163 0.140841 269.4601 4.1170 162.3510 1.3498

124350 rows × 12 columns

The metric we chose:

In [9]:
lol_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 124350 entries, 0 to 149229
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   patch                 124260 non-null  float64
 1   champion              124350 non-null  object 
 2   position              124350 non-null  int64  
 3   kills                 124350 non-null  int64  
 4   deaths                124350 non-null  int64  
 5   assists               124350 non-null  int64  
 6   dpm                   124350 non-null  float64
 7   damageshare           124350 non-null  float64
 8   damagetakenperminute  124350 non-null  float64
 9   vspm                  124350 non-null  float64
 10  earned gpm            124350 non-null  float64
 11  cspm                  124350 non-null  float64
dtypes: float64(7), int64(4), object(1)
memory usage: 12.3+ MB

Baseline Model¶

In [122]:
lol_cleaned
Out[122]:
patch champion position kills deaths assists dpm damageshare damagetakenperminute vspm earned gpm cspm
0 12.01 Renekton 0 2 3 2 552.2942 0.278784 1072.3993 0.9107 250.9282 8.0911
1 12.01 Xin Zhao 0 2 5 6 412.0841 0.208009 944.2732 1.6813 188.0210 5.1839
2 12.01 LeBlanc 0 2 2 3 499.4046 0.252086 581.6462 1.0158 208.2312 6.7601
3 12.01 Samira 0 2 4 2 389.0018 0.196358 463.8529 0.8757 239.4046 7.9159
4 12.01 Leona 1 1 5 6 128.3012 0.064763 475.0263 2.4168 101.8564 1.4711
... ... ... ... ... ... ... ... ... ... ... ... ...
149225 12.23 Jax 0 4 0 5 450.5737 0.171729 608.3352 0.7762 331.7885 9.4826
149226 12.23 Vi 0 2 4 11 201.7660 0.076899 762.7897 1.3161 211.6198 4.9944
149227 12.23 Ahri 0 6 3 8 647.4128 0.246762 553.8695 2.2610 292.4747 7.9303
149228 12.23 Varus 0 7 0 12 954.3982 0.363768 292.1035 1.5186 351.4961 8.4702
149229 12.23 Ashe 1 2 1 13 369.5163 0.140841 269.4601 4.1170 162.3510 1.3498

124350 rows × 12 columns

We use all the columns except for position from lol_cleaned to predict whether a player's position is support¶

In [123]:
# Get training and testing dataset for X (values use to predict) and Y (response)
X = lol_cleaned.drop(columns = 'position')
y = lol_cleaned['position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
In [124]:
from sklearn.tree import DecisionTreeClassifier
In [125]:
preproc = ColumnTransformer(
    transformers = [
        ('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion'])
    ],
    #since patch is ordinal, do we have to perform ordinal encoding or it is fine to just use
    #one hot encoding
    remainder='passthrough'
)
pl = Pipeline([
    ('preprocessor', preproc),
    ('decision-tree', DecisionTreeClassifier(max_depth=2))
])
pl.fit(X_train, y_train)
Out[125]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_cols',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['patch', 'champion'])])),
                ('decision-tree', DecisionTreeClassifier(max_depth=2))])

Note: Even though the 'patch' column contains ordinal values, since the order of the patch does not have any meaningful relationship with the position we are trying to predict, that is, as the patch increase does not mean people are more likely to play support or less likely to play support. In conclusion, we perform one hot encoding rather than ordinal encoding.

Since our score on testing data and trianing data is very close, our model does a good job on generalizing on unseen data¶

In [126]:
# The score we got on training data
pl.score(X_train, y_train)
Out[126]:
0.9940490231820034
In [127]:
# The score we got on testing data
pl.score(X_test, y_test)
Out[127]:
0.9934701492537313

Final Model¶

We chose to use Decision Tree Classification as our model. The decision-tree algorithm is a constructive one that aids essential decision-making processes. It bisects the space into smaller and smaller regions, whereas Logistic Regression, as comparing, fits a single line to divide the space precisely into two. Since we aim to predict a true-false question, and the data distribution cannot be linearly separate, the decision tree best fits and answers it.¶

In [96]:
from sklearn.preprocessing import QuantileTransformer, FunctionTransformer
from sklearn.preprocessing import StandardScaler
In [97]:
# Define the function for function transformer
def k_a(df):
    return ((df['kills']+1)/(df['assists']+1)).to_frame()
In [98]:
# initialize functiontransformer
k_a_trans = FunctionTransformer(k_a)
In [99]:
# Get training and testing dataset for X (values use to predict) and Y (response)
X = lol_cleaned.drop(columns = 'position')
y = lol_cleaned['position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
In [101]:
# Create a pipeline for functiontransformer
k_a_pipe = Pipeline([
    ('to_k_a', k_a_trans)
])

Without best hyperparameters yet¶

In [103]:
# Fit and transform our training data and generate the model
preproc_final = ColumnTransformer(
    transformers = [
        ('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion']),
        ('quantile', QuantileTransformer(n_quantiles = 100), ['vspm', 'earned gpm']),
        #get rid of outliers. more accurate and more representative. so our model focuses on the majority part 
        ('k_a', k_a_pipe, ['kills', 'assists'])
    ],
    remainder='passthrough'
)
pl_final = Pipeline([
    ('preprocessor', preproc_final),
    ('tree', DecisionTreeClassifier(max_depth=2))
])
pl_final.fit(X_train, y_train)
Out[103]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_cols',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['patch', 'champion']),
                                                 ('quantile',
                                                  QuantileTransformer(n_quantiles=100),
                                                  ['vspm', 'earned gpm']),
                                                 ('k_a',
                                                  Pipeline(steps=[('to_k_a',
                                                                   FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
                                                  ['kills', 'assists'])])),
                ('tree', DecisionTreeClassifier(max_depth=2))])
In [104]:
# Score we got on training data
pl_final.score(X_train, y_train)
Out[104]:
0.9944564774506229
In [106]:
# Score we got on final data
pl_final.score(X_test, y_test)
Out[106]:
0.9942099845599588

We use GridSearchCV to search for the best hyperparameters for our final model¶

In [30]:
from sklearn.model_selection import GridSearchCV
In [166]:
# Hyperparameters to try
hyperparameters = {
    'tree__max_depth':[i for i in range(1,20)],
    'tree__min_samples_split': [2,5,10],
    'tree__criterion':['gini', 'entropy']
}
In [167]:
# perform GridSearchCV to search for the best Hyperparameters
searcher = GridSearchCV(pl_final, param_grid=hyperparameters, cv=5)
searcher.fit(X_train, y_train)
Out[167]:
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('cat_cols',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['patch',
                                                                          'champion']),
                                                                        ('quantile',
                                                                         QuantileTransformer(n_quantiles=100),
                                                                         ['vspm',
                                                                          'earned '
                                                                          'gpm']),
                                                                        ('k_a',
                                                                         Pipeline(steps=[('to_k_a',
                                                                                          FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
                                                                         ['kills',
                                                                          'assists'])])),
                                       ('tree',
                                        DecisionTreeClassifier(max_depth=5,
                                                               min_samples_split=10))]),
             param_grid={'tree__criterion': ['gini', 'entropy'],
                         'tree__max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                                             12, 13, 14, 15, 16, 17, 18, 19],
                         'tree__min_samples_split': [2, 5, 10]})

The hyperparameters we are going to use. Use them because we got them by GridSearchCV¶

In [129]:
# the best hyperparameters got by GridSearchCV
searcher.best_params_
Out[129]:
{'tree__criterion': 'gini',
 'tree__max_depth': 5,
 'tree__min_samples_split': 10}
In [136]:
# Final model with the best hyperparameters got by GridSearchCV
preproc_final = ColumnTransformer(
    transformers = [
        ('cat_cols', OneHotEncoder(handle_unknown='ignore'), ['patch', 'champion']),
        ('quantile', QuantileTransformer(n_quantiles = 100), ['vspm', 'earned gpm']),
        #get rid of outliers. more accurate and more representative. so our model focuses on the majority part 
        ('k_a', k_a_pipe, ['kills', 'assists'])
    ],
    remainder='passthrough'
)
pl_final_hyp = Pipeline([
    ('preprocessor', preproc_final),
    ('tree', DecisionTreeClassifier(criterion = 'gini', max_depth = 5, min_samples_split = 10))
])
pl_final_hyp.fit(X_train, y_train)
Out[136]:
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat_cols',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['patch', 'champion']),
                                                 ('quantile',
                                                  QuantileTransformer(n_quantiles=100),
                                                  ['vspm', 'earned gpm']),
                                                 ('k_a',
                                                  Pipeline(steps=[('to_k_a',
                                                                   FunctionTransformer(func=<function k_a at 0x7fc23814ea60>))]),
                                                  ['kills', 'assists'])])),
                ('tree',
                 DecisionTreeClassifier(max_depth=5, min_samples_split=10))])

Since our score on testing data and trianing data is very close, our final model does a good job on generalizing on unseen data¶

In [137]:
# Score we got on training data
pl_final_hyp.score(X_train, y_train)
Out[137]:
0.9953142759108747
In [151]:
y_pred = pl_final_hyp.predict(X_test)
y_pred
Out[151]:
array([0, 0, 0, ..., 0, 0, 1])
In [138]:
# Score we got on final data
pl_final_hyp.score(X_test, y_test)
Out[138]:
0.9947568193515183

Fairness Analysis¶

Group X: player who pick champion that is defualt as a support champion by league of legends offically

Group Y: player who pick champion that is not defualt as a support champion by league of legends offically

reference URL: https://www.leagueoflegends.com/en-us/champions/

Champions that defualt as a support by league of legends offically:¶

In [152]:
support_champion = ['Alistar', 'Anivia', 'Ashe',
                    'Bard', 'Braum', 'Fiddlesticks',
                    'Heimerdinger', 'Ivern', 'Janna',
                    'Karma', 'Kayle', 'Leona',
                    'Lulu', 'Lux', 'Morgana',
                    'Nami', 'Nautilus', 'Neeko',
                    'Orianna', 'Pyke', 'Rakan', 
                    'Rell', 'Renata Glasc', 'Senna',
                    'Seraphine', 'Sona', 'Soraka',
                    'Tahm Kench', 'Taliyah', 'Taric',
                    'Thresh', 'Yuumi', 'Zilean',
                    'Zoe', 'Zyra']

Analysis:¶

In [153]:
from sklearn import metrics
In [154]:
metrics.plot_confusion_matrix(pl_final, X_test, y_test)
/opt/anaconda3/envs/dsc80/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[154]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fc25b75f4f0>

Precision¶

In [155]:
metrics.precision_score(y_test, y_pred)
Out[155]:
0.9940274414850686
In [156]:
1 - metrics.precision_score(y_test, y_pred)
Out[156]:
0.005972558514931392

Recall¶

In [157]:
metrics.recall_score(y_test, y_pred)
Out[157]:
0.9799490770210058
In [158]:
1 - metrics.recall_score(y_test, y_pred)
Out[158]:
0.020050922978994246

a high false negative rate, recall, is bad

In [159]:
results = X_test
results['is_support'] = results['champion'].apply(lambda x: 1 if x in support_champion else 0)
results['prediction'] = y_pred
results['position'] = y_test

(
    results
    .groupby('is_support')
    .apply(lambda x: 1 - metrics.recall_score(x['position'], x['prediction']))
    .plot(kind='bar', title='False Negative Rate by Champion Group')
)
In [160]:
results.groupby('is_support')['prediction'].mean().to_frame()
Out[160]:
prediction
is_support
0 0.013301
1 0.801937
In [161]:
(
    results
    .groupby('is_support')
    .apply(lambda x: metrics.accuracy_score(x['position'], x['prediction']))
    .rename('accuracy')
    .to_frame()
)
Out[161]:
accuracy
is_support
0 0.997980
1 0.984313

Null Hypothesis: Our model is fair. Its accuracy for player who choose a support champion and player who choose a non-support champion are roughly the same, and any differences are due to random chance.

Alternative Hypothesis: Our model is unfair. Its accuracy for player who choose a support champion is lower than its precision for player who choose a non-support champion.

Evaluation Metric: Accuracy

Test statistic: Difference in accuracy (support minus non-support).

Significance level: 0.05.

In [168]:
obs = results.groupby('is_support').apply(lambda x: \
                                          metrics.accuracy_score(x['position'], x['prediction'])).diff().iloc[-1]
obs
Out[168]:
-0.01366635231094171
In [163]:
diff_in_acc = []
for _ in range(1000):
    s = (
        results[['is_support', 'prediction', 'position']]
        .assign(is_support=results.is_support.sample(frac=1.0, replace=False).reset_index(drop=True))
        .groupby('is_support')
        .apply(lambda x: metrics.accuracy_score(x['position'], x['prediction']))
        .diff()
        .iloc[-1]
    )
    
    diff_in_acc.append(s)

p-value:

In [164]:
(diff_in_acc > obs).mean()
Out[164]:
1.0

Conculsion: After the permutation test and observed the p-value, it seems like the difference in accuracy across the two groups is significant.¶